ESSNet on Statistical Disclosure Control |
|
Contact:Peter-Paul de WolfStatistics Netherlands P.O. Box 24500 2490 HA The Hague The Netherlands Phone: +31 70 337 5060 Last update: 10 Oct 2011 |
Task 5. Improvement of software for micro data
5.a. Big surveys5.a(1) Standardised anonymisation of microdata setsThis task relates to both, 2.a. and 2.b. The idea is that the creation of anonymised micro data files for researchers should be integrated into the standardised process of the production of statistics in the future. Particularly in the case of annual surveys where such a standardisation would mean considerable advantages for the data producers and the research community.We will test µ-ARGUS in order to establish whether it can be used as an instrument for the standardised anonymisation of big microdata sets and to eventually identify problems w.r.t. the integration of the software into the production infrastructure. This question will be investigated using the example of the German microcensus which is a 1% sample of the German population (e.g. 1.2 million records). Partners: DE Deliverables: A list of proposals for improvements to the software. 5.a. (2) Blocking methodsSome SDC methods for microdata protection are increasingly time consuming when very large surveys have to be protected. Some SDC microdata methods take linear time, while others (like microaggregation or optimal recoding) take time quadratic or, more generally, superlinear in the data set size. Also, disclosure risk assessment for microdata is often superlinear (e.g. record linkage is quadratic).Blocking is a popular approach to applying superlinear methods to large microdata sets. The idea is to split large data sets into smaller pieces (blocks) of manageable size that can be separately treated in a reasonable time. Blocking should be done in such a way that its impact on data utility is as small as possible. Usual blocking procedures involve selecting a number of variables in the data set, called blocking variables. Then records are sorted by the blocking variables and the sorted data set is divided as many times as needed to obtain manageable subsets. Normally, a block is defined as a subset of records sharing a particular combination of values of the blocking variables. Building blocks from blocking variables has several drawbacks:
Partners: URV, CBS, DE. URV for the report, NL for the implementation in ARGUS and DE for the testing Deliverables: We will produce a report in the first year and an implementation of a cluster-based blocking mechanism in the second year. References: W. Cohen and J. Richman (2002), Learning to match and cluster high-dimensional data sets for data integration, in ACM SIGKDD'2002.A. McCallum, K. Nigam and L. Ungar (2000), Efficient clustering of high-dimensional data sets with application to reference matching, in ACM SIGKDD'2000, pp. 169-178.A. Solanas, A. Martínez-Ballesté, J. M. Mateo-Sanz and J. Domingo-Ferrer (2006), A 2d-tree-based blocking method for microaggregating very large data sets, in Proceedings of ARES/DAWAM 2006, IEEE Computer Society, pp. 922-928.5.b. Alternative risk models or Microdata dissemination strategies.Microdata users are very diverse and an efficient data dissemination strategy should take full account of this heterogeneity as has already been made clear in the literature. The distinction between Public Use Files and Microdata Files for Research starts from different scenarios, different risk models, possibly different masking procedures and different data utility / information loss requirements.For social microdata masking and risk/utility assessment for Public Use files and Microdata Files for Research will be further investigated. Survey specific data dissemination strategies for enterprise microdata will also be investigated. A unified framework for risk assessment and data protection will be developed. Further analysis of protection methods that take into account information loss will be undertaken. Partners: IT, UK Deliverables: One report will be delivered at the end of both years to illustrate the progress made for social surveys. Examples on real surveys will be presented. At the end of each year one report will be produced on risk models and data protection for enterprise microdata. 5.c. Record linkageRoughly speaking, record linkage consists of linking each record a in file A (protected file) to a record b in file B (original file). The pair (a,b) is a match if b turns out to be the original record corresponding to a. To apply this method to measure the risk of identity disclosure, it is assumed that an intruder has got an external dataset sharing some (key or outcome) variables with the released protected dataset and containing some identifier variables (e.g. passport number, full name, etc.). The intruder is assumed to try to link the protected dataset with the external dataset using the shared variables. The number of matches gives an estimation of the number of protected records whose respondent can be re-identified by the intruder. Accordingly, disclosure risk is defined as the proportion of matches among the total number of records in A.There are two main types of records linkage: distance-based record linkage and probabilistic record linkage. Within ESSNet, we propose to implement distance-based record linkage in µ-ARGUS. Distance-based record linkage consists of linking each record a in file A to its nearest record b in file B. Therefore, this method requires a definition of a distance function for expressing nearness between records. This record-level distance can be constructed from distance functions defined at the variable level. Construction of record-level distances requires standardizing variables to avoid scaling problems and assigning each variable a weight on the record-level distance. Partners: (URV, NL, DE) Deliverable: Software for distance-based record linkage produced within the CASC project (deliverable 1.2 D6) will be included in µ-ARGUS during the first year. The appropriate documentation will be included in the µ-ARGUS manual. |